182 research outputs found
Optimal Multistage Algorithm for Adjoint Computation
International audienceWe reexamine the work of Stumm and Walther on multistage algorithms for adjoint computation. We provide an optimal algorithm for this problem when there are two levels of checkpoints , in memory and on disk. Previously, optimal algorithms for adjoint computations were known only for a single level of checkpoints with no writing and reading costs; a well-known example is the binomial checkpointing algorithm of Griewank and Walther. Stumm and Walther extended that binomial checkpointing algorithm to the case of two levels of checkpoints, but they did not provide any optimality results. We bridge the gap by designing the first optimal algorithm in this context. We experimentally compare our optimal algorithm with that of Stumm and Walther to assess the difference in performance
Enhancing Virtual Distillation with Circuit Cutting for Quantum Error Mitigation
Virtual distillation is a technique that aims to mitigate errors in noisy
quantum computers. It works by preparing multiple copies of a noisy quantum
state, bridging them through a circuit, and conducting measurements. As the
number of copies increases, this process allows for the estimation of the
expectation value with respect to a state that approaches the ideal pure state
rapidly. However, virtual distillation faces a challenge in realistic
scenarios: preparing multiple copies of a quantum state and bridging them
through a circuit in a noisy quantum computer will significantly increase the
circuit size and introduce excessive noise, which will degrade the performance
of virtual distillation. To overcome this challenge, we propose an error
mitigation strategy that uses circuit-cutting technology to cut the entire
circuit into fragments. With this approach, the fragments responsible for
generating the noisy quantum state can be executed on a noisy quantum device,
while the remaining fragments are efficiently simulated on a noiseless
classical simulator. By running each fragment circuit separately on quantum and
classical devices and recombining their results, we can reduce the noise
accumulation and enhance the effectiveness of the virtual distillation
technique. Our strategy has good scalability in terms of both runtime and
computational resources. We demonstrate our strategy's effectiveness through
noisy simulation and experiments on a real quantum device.Comment: 8 pages, 5 figure
Model Checking Race-freedom When "Sequential Consistency for Data-race-free Programs" is Guaranteed
Many parallel programming models guarantee that if all sequentially
consistent (SC) executions of a program are free of data races, then all
executions of the program will appear to be sequentially consistent. This
greatly simplifies reasoning about the program, but leaves open the question of
how to verify that all SC executions are race-free. In this paper, we show that
with a few simple modifications, model checking can be an effective tool for
verifying race-freedom. We explore this technique on a suite of C programs
parallelized with OpenMP
Automatic Differentiation for Adjoint Stencil Loops
Stencil loops are a common motif in computations including convolutional
neural networks, structured-mesh solvers for partial differential equations,
and image processing. Stencil loops are easy to parallelise, and their fast
execution is aided by compilers, libraries, and domain-specific languages.
Reverse-mode automatic differentiation, also known as algorithmic
differentiation, autodiff, adjoint differentiation, or back-propagation, is
sometimes used to obtain gradients of programs that contain stencil loops.
Unfortunately, conventional automatic differentiation results in a memory
access pattern that is not stencil-like and not easily parallelisable.
In this paper we present a novel combination of automatic differentiation and
loop transformations that preserves the structure and memory access pattern of
stencil loops, while computing fully consistent derivatives. The generated
loops can be parallelised and optimised for performance in the same way and
using the same tools as the original computation. We have implemented this new
technique in the Python tool PerforAD, which we release with this paper along
with test cases derived from seismic imaging and computational fluid dynamics
applications.Comment: ICPP 201
Efficient precision simulation of processes with many-jet final states at the LHC
We present a scalable technique for the simulation of collider events with
multi-jet final states, based on an improved parton-level event file format.
The method is implemented for both leading- and next-to-leading order QCD
calculations. We perform a comprehensive analysis of the I/O performance and
validate our new framework using Higgs-boson plus multi-jet production with up
to seven jets. We make the resulting code base available for public use.Comment: 14 pages, 7 figures, 2 table
QContext: Context-Aware Decomposition for Quantum Gates
In this paper we propose QContext, a new compiler structure that incorporates
context-aware and topology-aware decompositions. Because of circuit equivalence
rules and resynthesis, variants of a gate-decomposition template may exist.
QContext exploits the circuit information and the hardware topology to select
the gate variant that increases circuit optimization opportunities. We study
the basis-gate-level context-aware decomposition for Toffoli gates and the
native-gate-level context-aware decomposition for CNOT gates. Our experiments
show that QContext reduces the number of gates as compared with the
state-of-the-art approach, Orchestrated Trios.Comment: 10 page
ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales
As we enter the exascale computing era, efficiently utilizing power and
optimizing the performance of scientific applications under power and energy
constraints has become critical and challenging. We propose a low-overhead
autotuning framework to autotune performance and energy for various hybrid
MPI/OpenMP scientific applications at large scales and to explore the tradeoffs
between application runtime and power/energy for energy efficient application
execution, then use this framework to autotune four ECP proxy applications --
XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with
a Random Forest surrogate model to effectively search parameter spaces with up
to 6 million different configurations on two large-scale production systems,
Theta at Argonne National Laboratory and Summit at Oak Ridge National
Laboratory. The experimental results show that our autotuning framework at
large scales has low overhead and achieves good scalability. Using the proposed
autotuning framework to identify the best configurations, we achieve up to
91.59% performance improvement, up to 21.2% energy savings, and up to 37.84%
EDP improvement on up to 4,096 nodes
- …